{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 9: Overfitting, underfitting, fitting polynomials, k-fold cross validation\n", "\n", "We will look at [Ancombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), which are four constructed datasets that appear similar statistics, including regression lines, but look different when graphed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import statsmodels.formula.api as smf\n", "import seaborn as sns\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import datasets, linear_model\n", "from sklearn.model_selection import KFold\n", "\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load Anscombe's quartet in from Seaborn and look at it. The four different datasets are indicated by the `dataset` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "anscombe = sns.load_dataset(\"anscombe\")\n", "anscombe.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make things easier, create a new dataframe for each of the four datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot of data\n", "Plot each of the data sets with a regression line using Seaborn. Which regression lines fit the data?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## R-Squared\n", "\n", "Compute R-Squared for each regression (you will have to compute the linear model using statsmodel). What do you notice about the R-Squared values? How do they compare to your visual assessment of the fit in the previous section?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overfitting and Underfitting models\n", "\n", "Overfitting and underfitting refer to the complexity of the model relative to the data. \n", "\n", "First plot Anscombe's second quartet and the regression line again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above data set and linear model is an example of *underfitting* because the model is too simple compared to the data. The linear model is not capturing the curve of the data.\n", "\n", "We can increase the order of a model to increase it's complexity. Add the parameter `order = 2` to your plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happened?\n", "\n", "What happens if we increase the order to 3?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overfitting\n", "\n", "Use Seaborn to plot Anscombe's first dataset with the regression line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if you add the parameter `order = 2`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if you use `order = 3`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an example of overfitting, because the equations for the lines are more complex than they need to be.\n", "\n", "## k-fold cross validation\n", "\n", "We will use k = 2, because our datasets are so small. We will also do the computations manually.\n", "\n", "Use `test_train_split()` from the last lab to split the third Anscombe data in half." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the linear model for the training data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Compute the mean squared error for the training data predictions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now compute the linear model for the test data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the mean squared error for the test data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do the two mean squared errors differ? Does this make sense? (You may need to plot the test and training data to answer this question.)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }